Ted Laderas
7/9/2017
You will need R and Rstudio to run this workshop.
You will need the following packages to do this workshop. Please install these packages using the install.packages() command.
library(tidyverse)
library(shiny)Cribbed from my twitter bio:
shiny/dplyr to teach EDADatasaurus Dozen
gRadual exposuRe can lessen fear…
shiny dashboards to ask key questions of data (EDA)dplyrHadley Wickham’s Data Wrangling Diagram
“Exploratory data analysis can never be the whole story, but nothing else can serve as the foundation stone.” - John Tukey, Exploratory Data Analysis
In contrast to Confirmatory Data Analysis (CDA), such as hypothesis testing, the goals of EDA are to:
EDA is vital when repurposing and reusing data that was collected for another purpose. Don’t go in blind!
Always read the data dictionary if provided! There is often some useful information in there about how the data is represented (such as the units for each column, etc).
Data Dictionary Example
summary() - Look for ‘strange’ valuestable() - Look for confounding among categorical variableshist() - Highlight outliersboxplot() - Compare conditional medians and distributionsplot.xy() - Look for correlations among continuous variablesdata.frameDashboard Image
Three commands:
filter() - remove rows according to criteriaselect() - select columns by namemutate() - calculate new column variables by manipulating datafilter() lets you select rows according to a criteria. You can use | (OR) and & (AND) to chain together logical statements.
library(dplyr)
newIris <- iris %>% filter(Species == "setosa" & Sepal.Length > 5)
head(newIris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 5.4 3.9 1.7 0.4 setosa
## 3 5.4 3.7 1.5 0.2 setosa
## 4 5.8 4.0 1.2 0.2 setosa
## 5 5.7 4.4 1.5 0.4 setosa
## 6 5.4 3.9 1.3 0.4 setosa
Note that any statement or function that produces a boolean vector (such as is.na(Species)) can be used here.
select() lets you select columns in your dataset.
Remember: “filter() works on rows, select() works on columns.” - Chester’s Mantra
library(dplyr)
newIris <- iris %>% select(Sepal.Width, Species)
head(newIris)## Sepal.Width Species
## 1 3.5 setosa
## 2 3.0 setosa
## 3 3.2 setosa
## 4 3.1 setosa
## 5 3.6 setosa
## 6 3.9 setosa
mutate() is one of the most useful dplyr commands. You can use it to transform data and add it as a new column into the data.frame:
library(dplyr)
newIris <- iris %>% mutate(sepalSum = Sepal.Length + Sepal.Width)
head(newIris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepalSum
## 1 5.1 3.5 1.4 0.2 setosa 8.6
## 2 4.9 3.0 1.4 0.2 setosa 7.9
## 3 4.7 3.2 1.3 0.2 setosa 7.9
## 4 4.6 3.1 1.5 0.2 setosa 7.7
## 5 5.0 3.6 1.4 0.2 setosa 8.6
## 6 5.4 3.9 1.7 0.4 setosa 9.3
#add a column with the same value for each entry
newIris <- newIris %>% mutate(value = "Site1")
head(newIris)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species sepalSum value
## 1 5.1 3.5 1.4 0.2 setosa 8.6 Site1
## 2 4.9 3.0 1.4 0.2 setosa 7.9 Site1
## 3 4.7 3.2 1.3 0.2 setosa 7.9 Site1
## 4 4.6 3.1 1.5 0.2 setosa 7.7 Site1
## 5 5.0 3.6 1.4 0.2 setosa 8.6 Site1
## 6 5.4 3.9 1.7 0.4 setosa 9.3 Site1
%>%The power of dplyr comes from the fact that you can chain multiple steps.
Example: Let’s calculate a new column SepalMean on iris and filter the dataset on this new variable.
library(dplyr)
data(iris)
iris2 <- iris %>% mutate(SepalMean = (Sepal.Length + Sepal.Width) / 2) %>%
filter(SepalMean > 4)
nrow(iris)## [1] 150
nrow(iris2)## [1] 123
head(iris2)## Sepal.Length Sepal.Width Petal.Length Petal.Width Species SepalMean
## 1 5.1 3.5 1.4 0.2 setosa 4.30
## 2 5.0 3.6 1.4 0.2 setosa 4.30
## 3 5.4 3.9 1.7 0.4 setosa 4.65
## 4 5.0 3.4 1.5 0.2 setosa 4.20
## 5 5.4 3.7 1.5 0.2 setosa 4.55
## 6 4.8 3.4 1.6 0.2 setosa 4.10
For the workshop, you’ll use the Shiny App to do some EDA. You will need to be familiar with the basic architecture of the app.
global.R - This is where you’ll do your work - place filtering and processing steps here. Any objects loaded here can be seen by both the ui.R and server.R.Shiny Architecture
An experimental weight loss drug was first tested at one site with volunteers (DatasetA). Given the small sample size, volunteers from an additional site were recruited (DatasetB). Read the data dictionaries!
Go as far as you can. Remember to use your post-its to show your status and whether you need help.
EDA is a puzzle with real-world consequences. Use your tools to understand the data!
Get the workshop materials here:
git clone http://github.com/laderast/shinyEDAin the shinyEDA/ folder, open up the .Rproj file
Datasets are in the data/ folder along with the data dictionaries and readmes.
Read weightLossAssignment.pdf for more details.
For each issue with the data:
You can load your own datasets for exploration and cleaning - just assign them the name dataset.
The dashboard tries to detect what are numeric variables (numeric) and what are categorical variables (ordered, factor). So you may need to set the type for each of the variables.
Cardiovascular Risk Prediction Workshop:
https://github.com/laderast/cvdNight1
Many more workshops in the future!
This work was funded by a Big Data to Knowledge (BD2K) T25 Grant: 1R25EB020379-01
Feel free to fork this the shiny app for your own purposes. It’s designed to be a simple introduction to Shiny as well.